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Abstract: A script detection system that is capable of handling several languages is 
becoming more necessary in today's world. The task of identifying scripts written in 
various languages has been substantially facilitated by the use of machine learning 
and deep learning, respectively. Machine learning techniques have used the Naive 
Bayes and Support Vector Machines (SVM) mechanism for the purpose of language 
detection. On the other hand, this paper reviews several unique deep-learning 
processes that have considered a range of methodologies, including LSTM and Bert. 
On the other hand, it has been shown that there is a need to improve the accuracy and 
the scalability often incorporated in multilingual systems. As a consequence of this, 
the primary focus of the present investigation is on the development of an innovative 
framework that is capable of recognizing scripts in a variety of languages. In 
addition, this technique considers pattern analysis while considering mixed script 
queries. A scalable, efficient, and adaptive approach has been established via study 
to increase the accuracy of the identification of a large number of languages. 
Accuracy, recall, and Fl-score are some of the performance metrics that have been 
calculated in order to evaluate the efficacy of the multilingual script identification 
that has been presented. In conclusion, it has been found that the approach that was 
provided has supplied a solution that is both efficient and scalable for the detection 
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of multilingual scripts. 


Introduction 

Information Retrieval is a field of computer science that 
focuses on satisfying users' information needs through IR 
systems. As the Internet is increasingly filled with content 
in languages like Hindi, Marathi, Tamil, and others, the 
ability to access information in multiple languages has 
become essential in our globally interconnected society 
(Shekhar and Sharma, 2020; Ojo et al., 2022; Gupta et al., 
2014; Khan and Sawarkar, 2024). The diversity of 
languages poses a challenge to effective communication in 
the digital age. Consequently, research in Information 
Retrieval has gained significant importance in recent 
years. One of the major challenges in cross-lingual and 
multilingual information retrieval is obtaining sufficient 
data when a query is launched in a local language. With 


the expansion of the World Wide Web, the amount of 
online content available in languages other than English is 
increasing. Users would greatly benefit from IR systems 
that can deliver relevant results in English and local 
languages. 

The spelling of words in text written in an original 
language but using a different script often deviates from 
standard rules and instead relies on the pronunciation of 
the script. Transliteration involves phonetically translating 
words from a language into a non-native or unfamiliar 
script (Karmi et al., 2011; Patel and Parikh, 2020; Kumar 
and Lehal, 2023; Dey et al., 2024). On the internet, the use 
of the Roman alphabet is growing in popularity for 
generating content and aiding users in finding information. 
Before applying other natural language processing (NLP) 
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techniques, the data needs to undergo pre-processing, 
which may include translation and/or transliteration. 
Transliteration serves as a means for machine translation 
(MT) and cross-lingual information retrieval (CLIR). 

Transliteration can be approached in two different 
ways. The first method is forward transliteration, which 
occurs when native words are written in an alien or foreign 
script. For example, the Hindi term ola (written in 
Devanagari script) translates to "life" in English and can 
be transliterated as jivan, Jeevan, jeeivan, or various other 
versions. On the other hand, back-transliteration involves 
translating a word from a non-native script back to its 
original script. In this case, "Jivan" would be back- 
transliterated to its original Devanagari script. While back- 
transliteration requires producing the same original word, 
forward transliteration offers more creative freedom to the 
transliterator. Karimi et al. (2011) conducted extensive 
research, but their seminal piece still summarizes machine 
transliteration well. In recent years, multilingual social 
media posts have increased, making it harder for IR 
systems to process and retrieve pertinent texts. 


Related Work 
Various natural language processing (NLP) 
applications, including code-mixed language 


classification, have been addressed and improved using a 
variety of ML methods and neural networks. When two or 
more languages' vocabulary and syntax are mixed together 
ina single sentence, this is called "code mixing," according 
to Sristy et al. (2017), Feurer and Hutter (2019), Chaitanya 
et al. (2018). Code mixing is also used when two languages 
are spoken at the same time. Code mixing occurs most 
often in casual circumstances, reflecting the conversants' 
propensity to switch languages while communicating, and 
it is clear that both languages are used concurrently in all 
grammatical and lexical components. Shekhar et al. 
(2020), Thara and Poornachandran (2018) and Patel and 
Bhattacharyy (2019) proposed a method for determining 
the language of bilingual text that was presented using 
Facebook, Twitter, and WhatsApp datasets. Some 
quantum LSTM network subclasses proficiently learned 
and predicted language in social media material. 
Regardless of the exact Hamiltonian form, the results show 
that ML techniques have a lot of room to grow in quantum 
dynamics. 

An extensive experiment using transfer learning and 
fine-tuning of BERT models was carried out by (Ansari et 
al., 2021) to decipher the language used in Twitter data. 
This study used a dataset that included code-mixed texts in 
Hindi, English, and Urdu for pre-training and word-level 
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language classification processes. Pre-trained code-mixed 
representations outperform monolingual ones. 

The primary emphasis is identifying mixed scripts 
within a dataset that include Roman Urdu, Hindi, Saraiki, 
Bengali and English (Yasir et al., 2021; Naosekpam and 
Sahu, 2023). In order to train the language identification 
model, the researchers utilised RNN and_ word 
vectorisation approaches. Moreover, they enhanced 
numerous model structures, including BGRU, GRU, 
bidirectional LSTM (Sasidhar et al., 2020; Anand et al., 
2022) and long short-term memory. The study attained a 
high-performance score through experimentation. Roman- 
English word-styling, generative spellings, and phonetic 
typing are only a few of the multilingual difficulties 
explored in the study. 

The document's language was successfully deciphered 
word-by-word in code-mixed English, Bodo Assamese, 
and other languages (Mosa, 2020; Ojo et al., 2022 ). In 
order to analyse and predict the language of Facebook- 
sourced content, the researchers used a variety of 
The models’ 
language detection accuracy varied because they were 
trained on the code-mixed corpus utilising features based 


categorisation approaches. word-level 


on n-grams and dictionaries. Building upon Conditional 
Random Fields (CRF), the method demonstrated in allows 
for word-level language detection in code-mixed text 
(Thara and Poornachandran, 2018). This method relies on 
lexical, contextual, character n-gram, and unique character 
properties, making it applicable to a wide range of 
languages. Across a variety of language pairs, the 
experimental results show that the CRF-based method 
outperforms alternative datasets time and time again. 
Researchers used datasets of chat conversations written in 
a combination of English-Bengali and English-Hindi to 
identify word-by-word language transitions (Dutta et al., 
2015). The author evaluated the system's performance in 
several languages and created a code-mixing index to 
measure the amount of language blending in the corpora. 
the 
interchange of certain characters, and Sarma et al. (2018) 
presented various ways to learn this sequence. Using the 


Standard transliterations sometimes include 


given transliterations as 
demonstrated how these 


examples, the researchers 

algorithms outperformed 
competing methods in identifying Hindi words. Their one- 
of-a-kind experimental model considers language along 
with part-of-speech of nearby words while attempting to 
identify languages at the word level. Experimental 
findings clearly show that the proposed model achieves 
better accuracy than prior methods. An approach to the 
problem of syllable along with character n-gram 


identification in code-mixed and multi-script texts, was 
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proposed by Shashirekha et al. (2022) to improve ML 
classifiers. We tested the suggested models with three 
Dravidian language pairs: Malayalam and English, Tamil 
and English, Kannada and English. ML classifiers’ output 
showed that code-mixed and multi-script texts might be 
better analysed with the addition of syllables along with 
character n-gram features. 

In order to identify words in code-mixed data at the 
language level, Mandal and Singh (2018) developed a 
novel framework for language tagging using a 
multichannel neural network that integrates CNN with 
LSTM (Shekhar et al., 2018; Jitta et al., 2017; 
Kozhirbayev et al., 2018; Shanmugalingam et al., 2018; 
Velankar et al., 2022). The multichannel neural network 
showed good results in language identification when 
combined with a Bi-LSTM-CREF context capture module, 
thanks to this architecture's integration of contextual 
information. 


the above machine 
transliteration systems encounter several challenges, 


including: 


According to literature, 


1. Script Requirements: Determining the appropriate 
script for transliterating a particular word or name can be 
complex, especially when dealing with multilingual texts 
where multiple scripts may be used. 

2. Sound Gaps: Some languages may have sounds 
that do not exist in the target language, leading to 


difficulties in phonetic representation during 
transliteration. 
3. Transliteration Variations: Different 


transliteration variations may exist for the same word or 
name, resulting in inconsistencies in the transliteration 
process. 

4. Language of Origin: Identifying the language of 
origin of a word or name is crucial for accurate 
transliteration. However, in code-mixed or multilingual 


texts, this task can be challenging. 
Table 1. Summary of state of the art Model/Approaches. 


Year Author MT Model / Strength Research Gap 
Approach 
2022 | Velankar et al., DL based Identifying and Opinions on certain subjects 
2022 approaches, naive categorising hate speech shift across time 
bayes, SVM on Twitter and Facebook 
databases 
2022 | Chakravarthi et Machine learning The language used by _| Limited resource dataset for 
al., 2022 and deep learning David in the coded other Dravidian languages. 
sample 
2021 | Ravikiran and Mulitlingual BERT Database for (DOSA) Findings from less complex 
Annamalai, and Distil BERT used for code-mixed text models, such as LSTM- 
2021 Model used in Tamil and English. CREF and its derivatives, are 
omitted. 
2020 | Shekhar et al., BiLSTM Determine the Performing an analysis of 
2020 programming language brief textual material 
Contradictory 
information 
2019 | Shashirekha et Machine learning Recognise Hate Speech Discontinued in mixcode 
al., 2022 and Detect Offensive 
Language 
2018 Sharma and OOVTTM model Improving word Words pertaining to named 
Mittal, 2018 combinations foundin | objects have been translated 
dictionaries incorrectly. 
2016 | Palangi et al., RNN Learning Useful for words with Proficiency with the subject 
2016 semantic meaning area's lexicon is necessary 
2015 | Raghavi et al., SVM Sorting social forum Normalisation of term 
2015 topics into categories variation in code-mixed 
based on their languages data exists 
2015 | Roy et al., 2015 Grapheme- Intent word detection Not well-suited for 
cooccurrence, corpus multiple-word 
Matching for MLM 
2014 Gella et al., Word identification, | Transliterate or Translate Problems with 
2014 SVM transliteration, Badha, 
Badhaa, Barha, and others. 
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5. 
a name or word should be translated or transliterated (or a 


Translation vs. Transliteration: Deciding whether 


combination of both) requires careful consideration and 
context-aware processing. 

Addressing these difficulties is essential for improving 
accuracy and systems’ 
effectiveness. 

Proposed Framework for language detection and 
pattern Analysis in Mixed Script Queries 

As shown in figure 1, the proposed framework 


machine transliteration 


identifies the language from mix-code text, processes it 
and returns the intention of that query or sentence. 


f 


) 
ce 


User 


Input (mixed scnpt) 


: aD 
Identify the Pauline 
Language Base 


Convert into 
Enghsh/ Roman 
Script 


languages, including Hindi, English, and knowledge- 
based. The sentence-level language identification method 
is exclusive to the English language. To improve the word- 
level classifier's accuracy, the labelling sequence is 
utilised. When the classifier is confused about the meaning 
of a word, this usually helps with the labelling process. 
Mislabeling occurs due to overlap between the two 
languages’ shorter words. In such a case, the label of the 
word that comes before or after it can provide useful 
context for understanding the meaning of the term in 
question. Language identification procedure is depicted in 
figure 2, which is provided below. 


Predicted Intention 


Apply Deep Learning & 
Recurrent Neural Network 


Identify Region 
of Intent 


Knowledge Base 
(Region of Intent) 
(Trained Data) 


Precsse Input 
Query Statement 


Figure 1. A Framework for Pattern Analysis and Intent Identification in Mixed Script 


Queries. 
In this model, the user submits mix-code Algorithm: Procedure Language Identification () 
scripts/sentences, and the language identifier finds all { 
keywords/tokens of user scripts/sentences in their Input: mixCodeScript (string) /Read mix-code 


language with the help of a knowledge base and converts 
them into English/Roman script. After the preprocessing 
model, put this script in a particular region of intent with 
the help of a knowledge base (trained dataset) and apply 
deep learning and RNN to predict intention. 
Language Identification 

The script that users enter could be mixed code or 
multilingual. Create a label sequence using word-level 
classification and use Bidirectional LSTM (Kazi et al., 
2020; Mandl et al.. ,2020; Mabokela, 2019) for sentence- 
level classification. But looking at it from a problem- 
solving standpoint, it's all in Roman letters. In order to 
train a word-level classifier to understand both native and 
English terms, we employ words from each of the 
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script/sentence/query from the user 

vectorSequence = BiDirectionalLSTM(mixCodeScript) 
//Generate sequence of vectors using Bi-Directional 
LSTM 

sentenceClassification = 
SentenceLevelClassification(vectorSequence) 
//Apply sentence level classification 
wordClassificationInput = 

Prepare WordClassificationInput(vectorSequence, 
sentenceClassification) 

// Forward to word level classification process 
labeledSequence = WordLevelClassification 
(wordClassificationInput, knowledgeBase) 

// Apply word level classification with knowledge base 


Int. J. Exp. Res. Rev., Vol. 43: 214-228 (2024) 


unilingualOutput = Transliterate(labeledSequence) 

// Transliterate the output of word level classification 
into unilingual (English/Roman) 
Output(unilingualOutput) // Output the result 


(Hindi, English, and Other), and if the list of other words 
is not empty, it undergoes the same process once more. 
This involves converting all possible words into Hindi or 
English lists using KB_ABB and the knowledge base. 


} Furthermore, context analysis is employed for ambiguous 


Recurrent Neural Network 
{Bi Directional LSTM) 
Sentence level 
classification 


Uni-code 


(Wi: Hi} 
(Wes Eng} 
(Wi: Hi} (W,, W, Wa, Wn} 


——* + €fEng} 


{Was Hi} 
{Wa Eng} 


Mixed code 


Figure 2. Flow diagram for language identification. 
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Figure 3. Process to classify the words. 


True 


Figure 3 illustrates the intricate process of word words, resulting in a set of words paired with their 
classification. The words are categorized into three lists corresponding languages. 
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The structure of KB_ABB, as shown in Table 2, 
supports the word classifier to train the system to 
understand the abbreviations/short-length words. 


List<word;, confidence, 
language>listOther=detectLanguage (wordi, 
engDictionary); 


Table 2. Representation of KB_ABB for abbreviations. 


English/Roman Script List of Abbreviations 


Roman words 


FINE F9->5N->FYN 


English/Roman Script List of Abbreviations 
Roman words 


See 


GREAT gr8 


Brother 


FINE BY ME FBN 


Be 


For Your Information FYI 


Before 


To be Honest TBH 


Best Friend Forever 


Did you know DYK 


End of Day 


By the Way BTW 


See you tomorrow 


As soon as Possible ASAP 


Oh My God OMG 


NI8 Night 
The structure of the Knowledge Base shown in table 3 
supports the word classifier to train the system to 
understand the native words. 
Table 3. Hash table Representation of Knowledge 
Base. 
Base 
Words 


Similar 
possibilities 
khushbuu 
khushbu 


Similar Base 
possibilities Words 


Algorithm: Word Level Classification: 
Word_level_language_detection(mixed_code) 


{ 

STEP -1:split mixed-code into words 

List<words>words= tokenization(mixed-code) 

STEP -2:Find out confidence level of each word with 

different vocabulary 

For All words of list 

If (wordi€ {knowladgeBase, engDictionary} 

List<wordi, confidence, language> 

listHindi=detectLanguage (word;, knowladgeBase); 

List<wordi, confidence, language> 

listEng=detectLanguage (word;, engDictionary); 
Else 
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End Loop 
If (listOther is notEmpty) 
update (List<words>words, KB_ABB); 
Repeat step 2; 
STEP -3:Consider final language of word with max 
confidence value 
For all listHindi and listEng 
if(listHindi <confidence> != listEng<confidence >) 
List<word, language> language = 
max(listHindi<confidence >, listEng<confidence >); 
Else 
// words which have confusion, need to apply context 
Analysis 
List<word, language> language= 
contextAnalysis(listHindi, listEng) 
End if 
End Loop} 
Experimental Evaluation 
The authors shared their experiments’ findings on a 
dataset containing various mixed scripts used by users on 
different social media platforms. Every word in this 
dataset has been tagged with one of two languages: mixed 
code and numbers, digits, and special symbols. The text 
was culled from social media. Twenty scripts should be 
considered for classification after preprocessing. Table 3's 
first column gives the script, and the second column shows 
the number of sentences or scripts. Tables 4 and 5 
summarise the sample dataset that has been annotated at 
sentence level. The sentence-level annotations are 
included in Table 5, together with the word-level 
annotations that were obtained from them. 
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Table 4. Sample dataset: The scripts taken from various social media app/data sets taken from MSIR. 


1 main aaj main market jaunga 

2 Tum bahut dust ho 

3 Tum sab log aajao 

4 Aaj main khush hu becoz today is my birthday 

5 BTW main kal aa jayuga 

6 Howru 

7 Im f9 

8 Tere Suit Ke Re Saare Re Colour Baawali Tere Aage Saari Chhori Sai Blur Baawali 
9 Today is my Birthday. Or Mai Bahut hee khush hun 

10 Kya tum is restaurant main ek table book karne main meri help karoge 
11 Taj Mahal is in India. Ye Bahut hi khoobsurat hai 

12 Log Bol rhe hai Jaishah ne world cup ki team khareed le hai 

13 Code deploy hone may abhi time lagega 

14 University ne abhi students ki marksheet nhi send ki hai 

15 Mera resume abhi updated nhi hai. 

16 Aajkal sabi Paytm use kar rhe hai. 

17 Mujhe Bank may paise deposit karne hai. 

18 Morning may sabhi ko walk karni chahiye. 

19 Hello may bol rhi hu How r u. 
20 Teacher ne sabhi Topic cover karwa diye hai. 


Performance of the proposed system is measured 
through precision, recall, and _ f-measure. 
Precision, recall, accuracy, and F-score are defined in 


accuracy, 


equations 1, 2, 3 and 4, respectively. Accuracy is a very 
important performance metric, which is the result of the 
ratio of predicted (TP+TN) to all (TP+TN+FP+FN) 
observations, as given in equation (1). 

TP+TN 


A ne 1 
curacy = TP + FP +FN+TN e 


Precision (P) is ratio of relevant (TP) and all retrieved 
(TP and FP) words as given in equation (2) 
TP 


Pp a 2 
recision TP + FP 


(2) 

Recall (R) The proportion of recovered and relevant 
words to all relevant languages available is indicated by 
the recall (R) as given in equation (3). 


Recall (R) = ———_ 3 
ecall (R) = Toy EN (3) 

F-Measure is the average value of Precision or Recall 
weights as given in equation (4). False positive and 


negative both values are considered as a result. 
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2 * (precision * Recall) 
F Measure = ————_ 


(4) 


(Precision + Recall) 


Table 5. Description of sample dataset annotated at 
sentence level. 


Language #Sentences | Avg length 
Hindi (hn) 3 9 words 
English (en) 2 3 words 
MixedCode-(mc) 15 9 words 
Total 20 


Table 6. Explanation of word-level annotations 
acquired through sentence-level annotations. 


Status Language Total 
Words 
Resolved Hindi (hn) 61 
English (en) 49 
Unresolved 44 
Total 154 


Using algorithm 1 and algorithm 2, each word language 
is identified in the script for a dataset as given in Table-3. 


Int. J. 


Table 7. Confidence level/Probability of each word with different vocabulary. 


Sentence 
ID 


Word Lang. Detected Probability 
main 1.00000 
Aaj 1.00000 
main 1.00000 
Market 0.99804 
Jaunga 1.00000 
0.99999 

1.00000 

0.71428 

0.71428 

0.99999 

1.00000 

0.99999 

1.00000 

Aaj 1.00000 
Main 0.99999 
Khush 1.00000 
hu 0.85714 
becoz 1.00000 
Today 1.00000 
Is 0.99999 
my 0.71428 
birthday 1.00000 
1.00000 

1.00000 

0.85714 

1.00000 

0.99999 

0.71428 

1.00000 

0.99999 

0.99999 

1.00000 

0.85768 

Tere Hi 0.99999 
Suit Hi 1.00000 
Ke Oth 0.99999 
Re Hi 1.00000 
Saare Hi 1.00000 
Re Hi 0.99999 
Colour En 0.71777 
Baawali Hi 1.00000 
Tere Hi 0.99999 
Aage Hi 1.00000 
Saari Hi 0.99999 
Sai Amb 0.85714 
Blur En 0.85714 
Baawali Hi 1.00000 
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Today Hi 1.00000 
is Oth 0.99999 

my En 0.85714 
Birthday En 0.85714 
9 Or Amb 1.00000 
Mai Hi 1.00000 
Bahut Hi 1.00000 
hee Hi 1.00000 
khush Hi 0.85714 
Hun Hi 0.99999 
Kya i 1.00000 
tum 1.00000 
Oth 0.99999 
restaurant 0.85558 
main 0.99999 
ek 0.99999 
table 0.85714 
Book 1.00000 
0.99999 

1.00000 

0.85684 

1.00000 

0.85714 

1.00000 

1.00000 

0.99999 

0.99999 

1.00000 

1.00000 

1.00000 

0.99999 

khoobsurat 1.00000 
Hai 0.46875 
Log En 0.98956 
Bol Oth 0.64093 
Rhe Hi 0.98828 
Hai Hi 0.46875 
Jaishah Hi 0.80107 
Ne Oth 0.21705 

12 World En 1.00000 
Cup En 1.00000 
Ki Oth 0.55078 
Team En 1.00000 
Khareed Hi 1.00000 
Le Oth 0.35573 

hai Hi 0.46875 
Code En 1.00000 
0.87206 

13 0.98047 
0.78506 

1.00000 

1.00000 

University en 1.00000 
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Ne Oth 0.21705 
Abhi Amb 0.78506 
students En 0.98654 
Ki Oth 0.55078 
marksheet Hi 0.40933 
Nhi Hi 0.41016 
Send En 1.00000 
Ki Oth 0.55078 
Hai Hi 0.46875 
Mera 0.70000 
Resume 1.00000 
Abhi 0.78506 
updated 1.00000 
Nhi i 0.41016 
i i 0.46875 
0.89960 
0.98047 
0.94271 
1.00000 
0.52713 
0.98828 
0.46875 
Mujhe i 1.00000 
Bank 0.94282 
May 0.98047 
Paise 0.76802 
deposit 0.98949 
Karne 0.58984 
Hai i 0.46875 
Morning 1.00000 
May 0.98047 
Sabhi i 1.00000 
Ko 0.35433 
Walk 1.00000 
Karni 0.71354 
chahiye i 0.98764 
1.00000 
0.98047 
0.64093 
0.79346 
0.75196 
1.00000 
0.93519 
0.76133 

Oth 


1.00000 
1.00000 
0.91016 
0.84942 
0.46875 
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Word-level classification is experiencing inaccuracies, 
even though the identification of mixed languages is 
When dealing with the 
surrounding word labels play a crucial role in determining 
the language of the current word. In this system, KB_ABB 


accurate. shorter words, 


and the knowledge base assist in determining words in 
specified languages. Calculate the frequencies of Hindi, 
English, other, Ambiguous (Amb) and abbreviation (Abb) 
words in a script as a proportion of the total words. 


The the 
aforementioned categories are presented in Table 7, and a 


summarized word-level results for 
graphical representation of the summary can be found in 
Figure 4. In the proposed system described above, all 
words identified as abbreviations (Abb) utilize KB_ABB, 
while ambiguous (Amb) words are detected using a 
Knowledge Base designed to return words based on user 
context. Evaluate Precision, Recall, F-Measure, and 
Accuracy for word-level identification using equations 1, 


Table 8. Sentence/script wise Words- Level Identification. 


Sentence ID aes Hindi English Accuracy Others Amb 

1 5 2 1 0.6 0 2 0 
2 4 3 0 0.8 0 1 

3 4 3 1 1.0 0 0 0 
4 9 3 4 0.8 1 1 0 
5 5 3 0 0.6 0 2 0 
6 3 0 1 0.3 0 0 2 
7 3 0 1 0.3 0 0 2 
8 15 10 3 0.9 1 1 0 
9 10 6 2 0.8 1 1 0 
10 13 6 4 0.8 1 2 0 
11 10 5 1 0.6 1 3 0 
12 13 5 4 0.7 4 0 0 
13 7 1 3 0.6 0 3 0 
14 10 2 4 0.6 3 1 0 
15 6 2 3 0.8 0 1 0 
16 7 3 2 0.7 2 0 0 
17 7 2 4 0.9 1 0 0 
18 7 2 4 0.9 1 0 0 
19 8 1 4 0.6 1 0 2 
20 8 2 3 0.6 3 0 0 


Accuracy at word level identification 


@ Accuracy 


7 9 11 13 15 17 19 
Sentence/script ID 


Figure 4. Accuracy at word level identification. 
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Table 9. Precision, Recall, F-Measure and Accuracy at world level identification. 


Sen_ID Precision Recall | Accuracy F-measure 
1 0.60000 1.00000 0.71429 0.75000 
2 0.75000 1.00000 0.80000 0.85714 
3 1.00000 1.00000 1.00000 1.00000 
4 0.77778 0.87500 0.72727 0.82353 
5 0.60000 1.00000 0.71429 0.75000 
6 0.33333 1.00000 0.60000 0.50000 
7 0.33333 1.00000 0.60000 0.50000 
8 0.86667 0.92857 0.82353 0.89655 
9 0.80000 0.88889 0.75000 0.84211 
10 0.76923 0.90909 0.75000 0.83333 
11 0.60000 0.85714 0.64286 0.70588 
12 0.69231 0.69231 0.52941 0.69231 
13 0.57143 1.00000 0.70000 0.72727 
14 0.60000 0.66667 0.50000 0.63158 
15 0.83333 1.00000 0.85714 0.90909 
16 0.71429 0.71429 0.55556 0.71429 
17 0.85714 0.85714 0.75000 0.85714 
18 0.85714 0.85714 0.75000 0.85714 
19 0.62500 0.83333 0.63636 0.71429 

20 0.62500 0.62500 0.45455 0.62500 


Average Value 


Precision Recall 


Accuracy 


m@ Average 
Value 


F-measure 


Figure 5. Average value of Precision, Recall, F-Measure and Accuracy. 


2, 3 and 4. The outcomes are detailed in Table 8, and a 
graphical representation is depicted in Figure 5. 

To understand the overall performance of the proposed 
system, we summarize average precision, recall, accuracy, 
and F-score. The average summarized details are shown 
in figure 5. Average F-score for the proposed system is 
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0.7559 and the accuracy is 0.6927. Hence, the proposed 
system performs better. 


Conclusion 
In the realm of linguistic diversity, the integration of 
machine learning techniques has yielded a remarkable 


framework capable of adeptly parsing mixed-script text, 
thereb 


introducing an innovative approach to language 
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detection (Kazi et al., 2020). This research stands out for 
its emphasis on the intricate dynamics of script 
amalgamation, particularly within the context of Hindi- 
English bilingual users on various social media platforms. 
The framework's excellence lies in its utilization of 
sequence-to-sequence models and attention mechanisms 
for pattern analysis, showcasing superior accuracy in both 
language detection and pattern extraction capabilities. 
Within the proposed system, the identification of 
abbreviations (Abb) leverages KB_ABB, while 
ambiguous (Amb) words are discerned through a 
Knowledge Base designed to adapt to user context. Word- 
level identification is made to thoroughly comprehend the 
system's performance using accuracy, precision, recall, 
and F-measure evaluation metrics. A varied dataset is used 
to conduct experiments, combining scripts common in 
social media user-generated material. Text from various 
social media sources is included in this dataset, and every 
word has been painstakingly labelled with one of two 
languages, which include a combination of mixed code, 
numbers, figures, and unique symbols. Using twenty 
different scripts for analysis, preprocessing techniques are 
used to get the dataset ready for categorisation. Tucked 
away in Table-3 are the findings that capture the spirit of 
the sentence/script numbers and the script descriptions that 
go along with them. Tabulated in Table 4, the sample 
dataset annotations are further broken down to the phrase 
level for easy understanding. In order to further 
comprehend the complexities of the dataset, Table 5 offers 
insights into word-level annotations that are generated 
from the basic sentence-level annotations. The suggested 
system's performance is summarised in Figure 6, which 
also includes the average recall, accuracy, and F-score. 
Importantly, the study achieves a remarkable 0.6927 
accuracy and an average F-score of 0.7559. When applied 
to the problems of mixed-script text analysis in the ever- 
changing social media environment, these findings 
demonstrate how well the suggested method performs. 
With its ability to handle complex script interactions and 
different language patterns, the framework is a major step 
forward in the ever-changing field of language processing. 
This is especially true in today's world of highly connected 
and multilingual cultures. 
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